perm filename SPINDL.REM[UP,DOC]5 blob
sn#360856 filedate 1978-06-09 generic text, type T, neo UTF8
\|\\; Header for POX
\⊂'0302450;\⊂'1000003;\; Turn on usual special-feature bits, W/OUT SIMULATOR
\M0BAXL30;\M2BASB30;\; Define two fonts that JMC likes to use
\←=6;\λ\; Increase interline spacing
\F2\CSPINDL - A PROGRAM FOR FILE COMPRESSION
\F2\Cby Robert E. Maas (REM)
\F2\C(This file is SPINDL.REM, usually on [UP,DOC])
\F0\J Disk space in the Lab is in short supply and will remain that way
for the forseeable future. Therefore, two methods of reducing the amount
of space taken by files are hereby offered. The first of these allows
combining several files into one, and the second uses Huffman coding to
further reduce the space taken to slightly less than half (in most cases)
its original volume. The two facilities are offered by the same program
called SPINDL.
Each disk file consumes at least a full disk block (2304 words), even if
it contains only one word of actual data. Most such files are text files,
and can now be combined into a "spindle" so that many fewer disk blocks
are consumed. Later, as desired, individual files in the spindle may be
retrieved. Since the system DIrectory command measures files in words
rather than blocks, SPINDL will help the system even if the DIrectory
command shows no reduction of word count.
To use this new facility, type the monitor command
R SPINDL
The program will announce itself and ask you for the name of the spindle
file you want to use, then prompt you with a ">" at the left margin. You
type a file name, default extension is ".SPI". If the file doesn't
already exist, it will be created. The program will now announce the
number of files currently in the spindle, and print a "*" at the left
margin to indicate that it is waiting for a command. If you type a "?"
followed by carriage-return, it will provide a list of legal commands.
The following commands will perform most of the useful functions you will
want:
DIrectory -- tells you all files in the spindle.
Spindle -- copies one or more files from the disk into the spindle.
(Note, the original remains on the disk, so at present the user must
manually delete it after spindling if he is to reduce disk-space
consumption.)
Unspindle -- copies a file from the spindle to the disk.
(Note, original remains in spindle until explicitly deleted.)
DElete -- deletes a file from the spindle (without reclaiming
space).
Bubble -- copies the entire spindle, deleting all wasted space
such as deleted files.
After some of the commands (for example, Spindle, Unspindle, and DElete)
the program will ask you for various items, such as the names of the files
to use, and the prompt for typing each such item will be a ">" at the left
margin. When it is all done, it will prompt "*" at the left margin
indicating it is ready for another top-level command. \.
\,
CRUNCH-AND-SPINDLE:
\J Greater data compression at the cost of more computer time is obtained
with the SPINDL command "Crunch". (It uses Huffman coding to reduce the
volume of an English or FAIL or SAIL, etc. file by a further factor of
two). SPINDL will ask you for auxiliary files called the history tree and
the Huffman tree, but carriage return will use default trees (if the
spindle already contains a crunched file, default is the same trees as
used the previous time, otherwise default is a particular pair of trees on
[UP,REM] which work quite well with most English language text such as
writeups and essays). Regardless of whether the default trees or some
other trees are used, one copy of each tree actually used will be copied
into your spindle file (if it isn't there already), so that crunching only
achieves a reduction of data if the files being crunched with a particular
pair of trees total at least 1400-3000 words (assuming a 2-to-1
compression and assuming 700-1500 word trees). Naturally, it is also
pointless to spindle a single file without crunching, or to crunch a
single file unless its size exceeds 2.3K. Appendix A explains how to make
special trees, assuming none of the already-existing trees listed in
Appendix B are adequate. (Currently only trees for English exist, but soon
an attempt will be made to create trees for FAIL, SAIL, LISP, and perhaps
LAP and SNOBOL.)
To uncrunch-and-unspindle, just use the "Unspindle" command and the right
thing will happen automatically, the correct trees will be selected from
the spindle file according to their hash number and will used for
uncrunching your file.
Timing (figures for KA/10 processor in brackets, figures for KL/10
processor are in plain text):
Spindle -- [3/4] 1/5 second per thousand words.
Unspindle -- [1/2] 1/10 second per thousand words.
Crunch-by-pages-and-spindle -- [4] 1 sec to load default
trees + [2] 1/2 sec per thousand words.
Uncrunch-by-pages-and-spindle -- [3.3] 3/4 sec to load trees
+ [1] 1/4 sec per thousand words.
When should a file be SPINDLed? Our current recommendation is that a file
not expected to be used in 30 days should be spindled, and if not expected
to be used for another 15 days, crunched too. These recommendations are
conservative, and will change in the direction of more prompt spindling
when the KL-10 is working.
Many people will be able to double the effective sizes of their disk
allocations by using these facilities. \.
\,
\F2\CAPPENDIX A
\F0\J At present, the process of making a new history-tree and
Huffman-tree pair is rather inefficient, and if the automatic mode is not
used it is very lenghthy and rather a hassle (see the gory details below).
If you already have a list of reversed strings to use, or want to use one
of REM's, skip over steps 1-5 and begin at step 6. If you already have
the two Polish files you want to use for crunching and just want to know
how to access them from SPINDL, you may begin at step 7.
NOTE -- If you are planning to perform all the steps automatically, from
the beginning to the generation of the Huffman code in step 6, without
manual intervention to edit either of the two intermediate files, may I
suggest that you run the whole process through HOTER.DMP[HOT,REM] so that
you won't have to interrupt the process before step 6 in order to initiate
HOTER at that time. With HOTER between your terminal and the PTY the
tasks below are being run on, you have the useful features of (1) saving
the TTY output on a disk file for later perusal (2) local output-
suppression while continuing to write on the disk file, so you don't have
to see the whole Huffman code (step 6) being typed out, but you can see
how it's coming along from time to time by toggling the local-suppress
flag.
RU HOTER[HOT,REM] <cr> -- Starts up the PTYJOB hack
<file> <cr> -- whatever output file you want TTY output written on
$ -- tells it you want $ to be converted into ↑C
(if you hit $$ you pass ↑C↑C to the PTY job and it stops while HOTER
copies the "↑C" etc. to TTY and DSK, but if you hit
↑C yourself you stop HOTER while the PTY job continues to run
until its output buffer gets full)
% -- % will be ↑U should you ever want to flush your input line
ctrl-shift-o -- ctrl-shift-o will be passed as ↑O to flush output from
program to PTY
ctrl-shift-l -- ctrl-shift-l will be local ↑O to flush output from PTY
to TTY without affecting output from program to PTY to DSK.
(Of course, the four characters are merely suggestions.
Whatever your favorite funny-characters are... Be sure to wait until HOTER asks you
for each character before you type it.)
STEP 1 -- Create a list of strings occurring most often:
RU CRU1[1,REM] -- Starts up the survey program.
A -- (Optional, "automatic" mode, if you do this, it makes steps 3,4,6
follow after step 1 without intervention on your part, skipping steps 2,5
-- If you do this, you don't have to worry about all the grunge, except
for a few places where a program asks you for a file name or some
parameter to type in.)
B -- Sets the program to Bigram/Trigram mode in which
it scans a file for the strings which occur most often.
The program will ask you for input file name, then ask you for one
or more parameters which may be defaulted by typing a carriage-return, and
write TOKENS.DAT containing the repeated strings found.
STEP 2 -- Modify the list of strings using your favorite
editor, employing whatever heuristics you have for eliminating
undesired strings or adding strings you believe will be
useful.
(STEP 2 is skipped if running in automatic mode)
STEP 3 -- Reverse the list of strings so that they read
backwards (from a given point in a file back to earlier
characters):
RU CRU1[1,REM]
A -- (Optional, automatic mode for steps 3,4,6)
R -- Sets the program to Reverse mode.
TOKENS.DAT -- Tells it the name of the file to read.
The program will write SNEKOT.DAT, in which each line
consists of a length-of-string character followed by a
string in parenthesis.
STEP 4 --
Sort the file into "crossword-puzzle-dictionary" sequence,
i.e. by length, and within each group alphabetically:
R SSORT -- Starts up the String-Sorter program.
SNEKOT.DAT/Q -- Tells it the name of the file to sort,
and sets the quiet flag for file overwrite.
STEP 5 -- Modify the list of reversed strings again if
desired. Seeing them alphabetized according to their new
backwards-format may suggest additional heuristics for purge
or addition.
(STEP 5 is skipped if running in automatic mode)
You may wish to rename SNEKOT.DAT to some more
permanent name at this time. <FILE> as used below means the
name of this file or any similar file already existing that
you may choose to use below.
[If you jumped over steps 1-5, there is a spindle called
SNEKOT.SPI[1,REM] which contains various lists of reversed
strings you may want to use.]
STEP 6 -- Scan your collection of files according to the list
of left-contexts defined by the reversed strings you have
created or found:
RU HOTER[HOT,REM] if you haven't done so yet -- this is where
you will really need it -- see before step1 for details.
If
you are sure you won't want to examine the Huffman code, omit this
HOTER part and just do the following directly on your
terminal.
RU CRU1[1,REM]
S -- Sets the program to Scan-by-Reversed-Left-Context
mode.
<FILE> -- The list-of-left-contexts file you
laboriously created.
The program will ask you one at a time for files to
survey according to the list of left-contexts that were read
in, compiling a complete '200 word table of how many times
each ASCII character occurs in that left-context. When you
are finished with all the files you want to include, type a
blank line instead of a file name.
(In many cases you will just want one file
surveyed, namely the file scanned for repeated strings in step 1.)
The program will now type out all the totals, which
you want to suppress by control-o or escape-o depending on the
type of terminal you are on.
(use CONTROL-SHIFT-O if running through HOTER)
After all that, output will be turned back on by the
program, it will write out the file HIST.POL which is a Polish
description of all the left-contexts (typically this file is
between 50 and 200 words).
After that, the program will begin computing an
abbreviated Huffman code for each left-context, and type on
the TTY or PTY the input bits, output bits, compression-ratio,
and a complete description of the Huffman code in nice
readable verbose format. If you are running this through
HOTER you will probably want to do a local ↑O (CONTROL-SHIFT-L) to
suppress output to TTY without affecting the copying of PTY
output to the disk file. (PTYJOB does not provide this
facility, only HOTER.) This process will probably take about
15 minutes because of the volume of data being inefficiently
copied, all the while the file HUFFS.POL will be written
containing each Huffman code generated (typically about
500-2000 words total).
You now have HIST.POL and HUFFS.POL containing a
complete description of the code to be used.
** If you are running all the above through HOTER, it is now
time to hit ↑C (on a display, hit <CALL>) and type REEnter.
This will close your output file containing the TTY transcript,
which includes the readable version of the Huffman code.
[If you jumped directly here, there are several useful history
and Huffman trees on [UP,REM] with extensions *.HIS and *.HUF
respectively.]
STEP 7 -- When using the Crunch-by-pages-and-spindle command
in SPINDL, specify the files you created in step 6 or the
files you found already existing *.HIS and *.HUF -- only one
copy of each different tree will be kept in the spindle that
contains files crunched by it.
\.
\,
\F2\CAPPENDIX B
\F0\J All existing trees are in the disk area [UP,REM].
History trees have extension .HIS and Huffman trees have
extension .HUF. The following table tells which are available
and what files they are good for.
\.
\MAGACL25;\FA\;
*.HIS *.HUF type of file it is for
NOTSEV NOTIC1 English text (surveyed NOTICE[UP,DOC]) (DEFAULT TREES)
NOTSEV SEVEN1 English text (same left-contexts but larger survey)
FAIS1 FAIS1 .FAI source programs (survey [S,SYS] etc.)
FAIS2 FAIS2 .FAI (improved version of FAIS1)
LSPS1 LSPS1 .LSP source programs (survey [TAL,TVR])
SAIS1 SAIS1 .SAI source programs (survey [*,HPM],[*,REM])